1 Resources for data visualization

Wickham et al. 2019. ggplot2

ggplot2 cheatsheet

Wilke CO. 2019. Introduction to cowplot

Wilke CO. 2019. Arranging plots in a grid

Tufte ER. 2001. The Visual Display of Quantitative Information

Wilke CO. 2019. Fundamentals of Data Visualization

Wilkinson L. 1999. The Grammar of Graphics

The R Graph Gallery

1.0.1 Plotting in R

There are two major sets of tools for creating plots in R:

    1. base graphics, which come with all R installations
    1. ggplot2, a stand-alone package.

Note that other plotting facilities do exist (notably lattice), but base and ggplot2 are by far the most popular. Check out this post on comparisons between base, lattice, and ggplot2 graphics to learn more.

1.0.2 Package installation

Install and library the following packages. Let’s get started!

install.packages(c("ggplot2", "cowplot", "dplyr"))

library(ggplot2)
library(cowplot)
library(dplyr)

1.0.3 The dataset

For the following examples, we will using the gapminder dataset. Gapminder is a country-year dataset with information on life expectancy, among other things.

gap = read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
head(gap)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134
str(gap)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

2 R base graphics

Base graphics are nice for quick visualizaitons of your data. You can make them publication-quality, but they take more effort than those produced by ggplot2. Let’s review base plotting calls for histograms, boxplots, and scatterplots.

2.0.1 Histogram

Histograms are useful to illustrate the distribution of a single continuous (i.e., numeric or integer) variable.

hist(x = gap$lifeExp)

# Define number of breaks
hist(x = gap$lifeExp, breaks = 5)

You can see the 657 available stock colors available to you by typing colors(). Why do you think there so many “greys”?

Change color of bars, title, x-axis label, and x and y scale limits

hist(x = gap$lifeExp, 
     breaks = 10, 
     col = "skyblue", 
     main = "Histogram of Life Expectancy",
     xlab = "Years",
     xlim = c(20, 90), 
     ylim = c(0, 350), 
     las = 1)

2.0.2 Boxplot

Boxplots are useful to visualize the distribution of a single continuous variable - which can be parsed by levels of a factor. For example, we can look at distributions of life expectancy by continent:

boxplot(gap$lifeExp ~ gap$continent, 
        # Give each box its own color
        col = c("orange", "blue", "green", "red", "purple"))

# There are five continents represented in this dataset
levels(gap$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
length(levels(gap$continent))
## [1] 5

2.0.3 Scatterplot

Scatterplots are useful for visualizing the relationship between two continuous (i.e., numeric or integer) variables.

# Points
plot(x = gap$gdpPercap, y = gap$lifeExp, type = "p") 

# Connected lines (not a smoothing line)
plot(x = gap$gdpPercap, y = gap$lifeExp, type = "l") 

# Both
plot(x = gap$gdpPercap, y = gap$lifeExp, type = "b") 

Add a title, change the x and y axis labels and limits, change point size and shape, and map each point . Type ?pch to learn more about point shapes in base plot

# Turn off scientific notation
# options(scipen = 999)

plot(x = gap$gdpPercap, y = gap$lifeExp, 
     type = "p", 
     main = "Example scatterplot", 
     xlab = "GDP per capita income (USD)", 
     ylab = "Life Expectancy (years)", 
     xlim = c(0, 40000), 
     ylim = c(20, 90),
     cex = 2, pch = 6)

3 The ggplot2 way

Base plotting is just fine, but it takes some slightly complicated code to map colors to points and shapes of each of the five continents. And, adding a legend gets even trickier. Thankfully, ggplot2 handles these complexities with ease using more compact code inspired by Leland Wilkinsons * grammar of graphics.

NOTE: ggplot2 is the name of the package, but ggplot is the main function call.

You need three things to make a ggplot:
1. Data
2. “aes”thetics: to define your x and y axes, map colors to factor levels, etc.
3. “geom_”s: the visual marks to represent your data - points, bars, lines, ribbons, polygons, etc.

One thing to remember is that ggplot2 works in layers, similar to photoimaging software such as Photoshop, Illustrator, Inkscape, GIMP, ImageJ, etc. We create a base layer, and then stack layers on top of that base layer. Add a new layer by typing a + symbol at the end of each line. Furthermore, “theme_”s are used to customize the non-data related aspects of your plots.

3.0.1 gg Histogram

3.0.1.0.1 Define the base layer

Pass in two arguments to the ggplot function to construct the base layer: the data and the global aesthetics (the ones that apply to all layers of the plot) defined within aes(). We see our coordinate system, but no data!

library(ggplot2)
ggplot(data = gap, aes(x = lifeExp))

Add your “geom_” to see the data!

ggplot(data = gap, aes(x = lifeExp)) + 
  geom_histogram(color = "orange", fill = "green")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ahh, my eyes! Always avoid chartjunk! Keep your visualizations simple and crisp so that they can efficiently communicate their point without losing your audience in chartjunk.

3.0.1.0.2 theme_

theme_s in ggplot2 are the non-data parts like the background, gridlines, legends, etc. One way to improve the background of your figure is to use the theme_ layer

ggplot(data = gap, aes(x = lifeExp)) + 
  geom_histogram(color = "black", 
                 fill = "gray80", 
                 bins = 10) + 
  theme_bw()

Add a title and change x and y axis labels similar to before - but note the syntax differences of each layer compared to base plotting arguments from earlier

ggplot(data = gap, aes(x = lifeExp)) + 
  geom_histogram(color = "black", 
                 fill = "gray80", 
                 bins = 10) + 
  theme_bw() + 
  ggtitle("Histogram of Life Expectancy") + 
  xlab("Years") + 
  ylab("Frequency") 

We can also assign this visualization to a variable for later use

lifeExp_hist = ggplot(data = gap, aes(x = lifeExp)) + 
  geom_histogram(bins = 10, 
                 fill = "green", 
                 color = "black") + 
  theme_bw() + 
  ggtitle("Histogram of Life Expectancy") + 
  xlab("Years") + 
  ylab("Frequency") 

# Call it to view
lifeExp_hist

4 Challenge 1

Open “ggplot2_challenges.Rmd”. Create a gg histogram using the heart dataset. Save it in a variable named A

4.0.1 gg Boxplot

ggplot boxplots are similar to base boxplots, but the helpful additions and customizations are easier to understand and define. Make boxplots of lifeExp for the five continents. What has fill = continent done!?

What do you think is the difference between fill and color?

ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) + 
  geom_boxplot() + 
  theme_minimal()

5 Challenge 2

Open “ggplot2_challenges.Rmd”. Create gg boxplots using the heart dataset. Save this figure in a variable named B

5.0.1 gg Scatterplot

ggplot scatterplots are again similar to base scatterplots, again with the ease of feature customization. Make a scatterplot of lifeExp by gdpPercap - what has color = continent done!?

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() + 
  theme_test()

The legend can be moved around by adding the legend.position argument to a theme layer

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() + 
  theme(legend.position = "right")

This is also helpful for manipulating the text of the axis labels

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() + 
  theme_bw() + # Does it still work if you add this theme after the other theme? 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) 

5.0.1.0.1 Custom scale breaks

To set custom breaks we use a different layer. To create a custom scale that goes from a start point to some end point by some interval

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() + 
  theme_bw() + 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_continuous(breaks = seq(from = 0, to = 120000, by = 20000), limits = c(0, 120000)) + 
  scale_y_continuous(breaks = seq(from = 20, to = 90, by = 10), limits = c(20, 90))

5.0.1.0.2 Point sizes, shapes, transparencies

Change point sizes, shapes, and transparencies

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, 
                       color = continent, 
                       # size = 29, 
                       shape = continent)) + 
  geom_point(alpha = 0.25, size = 6) + 
  theme_bw() + # Does it still work if you add this theme after the other theme? 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_x_continuous(breaks = seq(from = 0, to = 120000, by = 20000), limits = c(0, 120000)) + 
  scale_y_continuous(breaks = seq(from = 0, to = 90, by = 10), limits = c(20, 90))

5.0.1.0.3 Log transform axis

Alternatively, you can log transform an axis as well …

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, 
                       color = continent, 
                       size = 2, 
                       shape = continent)) + 
  geom_point(alpha = 0.55) + 
  theme_bw() + # Does it still work if you add this theme after the other theme? 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  # scale_x_continuous(breaks = seq(from = 0, to = 120000, by = 20000), limits = c(0, 120000)) + 
  scale_x_log10() + 
  scale_y_continuous(breaks = seq(from = 0, to = 90, by = 10), limits = c(20, 90))

… and add smoothing lines

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, 
                       color = continent, 
                       size = 2, 
                       shape = continent)) + 
  geom_point(alpha = 0.25) + 
  theme_bw() + # Does it still work if you add this theme after the other theme? 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  # scale_x_continuous(breaks = seq(from = 0, to = 120000, by = 20000), limits = c(0, 120000)) + 
  scale_x_log10() + 
  scale_y_continuous(breaks = seq(from = 0, to = 90, by = 10), limits = c(20, 90)) + 
  geom_smooth(method = "lm", se = TRUE, lwd = 1)
## `geom_smooth()` using formula 'y ~ x'

Save as a variable for later …

options(scipen = 999)
gdpLe_scatter = ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, 
                       color = continent, 
                       size = 2, 
                       shape = continent)) + 
  geom_point(alpha = 0.25) + 
  theme_bw() + # Does it still work if you add this theme after the other theme? 
  theme(legend.position = "top", 
        axis.text.x = element_text(angle = 45, hjust = 1)) + 
  # scale_x_continuous(breaks = seq(from = 0, to = 120000, by = 20000), limits = c(0, 120000)) + 
  scale_x_log10() + 
  scale_y_continuous(breaks = seq(from = 0, to = 90, by = 10), limits = c(20, 90)) + 
  geom_smooth(method = "lm", se = TRUE, lwd = 1)

gdpLe_scatter
## `geom_smooth()` using formula 'y ~ x'

6 Challenge 3

Open “ggplot2_challenges.Rmd”. Create a gg scatterplot using the heart dataset and save it in a varibale named C

6.0.1 gg Lineplot

Lineplots are useful for visualizing change in some variable on the y-axis plotted against time on the x-axis. There are many different ways to do this, including reshaping your data using the reshape2 or tidyr packages.

We will look at a quick dplyr review to add a column to our gap dataset of the mean lifeExp for each continent by year. Check out D-Lab’s Data Wrangling and Manipulation in R to learn more!

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gap_lifeExp_mean = gap %>%
  dplyr::group_by(year, continent) %>%
  dplyr::mutate(mean_lifeExp = mean(lifeExp))

head(gap_lifeExp_mean)
## # A tibble: 6 x 7
## # Groups:   year, continent [6]
##   country      year      pop continent lifeExp gdpPercap mean_lifeExp
##   <fct>       <int>    <dbl> <fct>       <dbl>     <dbl>        <dbl>
## 1 Afghanistan  1952  8425333 Asia         28.8      779.         46.3
## 2 Afghanistan  1957  9240934 Asia         30.3      821.         49.3
## 3 Afghanistan  1962 10267083 Asia         32.0      853.         51.6
## 4 Afghanistan  1967 11537966 Asia         34.0      836.         54.7
## 5 Afghanistan  1972 13079460 Asia         36.1      740.         57.3
## 6 Afghanistan  1977 14880372 Asia         38.4      786.         59.6

Plot!

ggplot(gap_lifeExp_mean, aes(x = year, y = mean_lifeExp, 
                             color = continent, 
                             linetype = continent)) + 
  geom_line(lwd = 3) + 
  theme_bw() + 
  theme(legend.position = "top")

Increase legend size using theme and change legend title

ggplot(gap_lifeExp_mean, aes(x = year, y = mean_lifeExp, 
                             color = continent, 
                             linetype = continent)) + 
  geom_line(lwd = 2) + 
  theme_bw() + 
  theme(legend.position = "top", 
        legend.title = element_text(color = "black", size = 12, face = "bold"), 
        legend.text = element_text(color = "black", size = 12, face = "bold")) + 
  guides(color = guide_legend(title = "Continent:   "), 
         linetype = FALSE)

Or: - remove the legend title
- increase the size of the legend lines
- increase the spacing of the legend items
- right align the legend text
- move labels to left of glyphs
- save as a variable

gap_line = ggplot(gap_lifeExp_mean, aes(x = year, y = mean_lifeExp, 
                             color = continent, 
                             linetype = continent)) + 
  geom_line(lwd = 2) + 
  theme_bw() + 
  theme(legend.position = "right", 
        legend.title = element_blank(), 
        legend.text = element_text(color = "black", size = 10, face = "bold"), 
        legend.key.width = unit(2.54, "cm"),
        legend.text.align = 1,
        legend.key = element_rect(size = 3, fill = "white", colour = NA), legend.key.size = unit(1, "cm")) + 
  guides(color = guide_legend(label.position = "left"))

gap_line

6.0.1.0.1 Faceting

You can also facet your plots to turn overlaid figures into separate ones. For example:

gap_line = gap_line + 
  facet_wrap(~continent) + 
  guides(linetype = FALSE) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
gap_line

7 Challenge 4

Open “ggplot2_challenges.Rmd” Create a gg lineplot using the heart dataset and save it in a variable named D

7.0.1 gg Heatmap

Heatmaps are useful when you want to plot three variables - one continuous variable by two factors.

  • Add a gradient scale fill based on lifeExp (darker colors represent more years)
  • Put years to left of legend glyphs
  • Save it as a varialbe
heat = ggplot(gap, aes(x = continent, y = year, fill = lifeExp)) + 
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "gray20", 
                      limits = c(20,90), breaks = seq(20, 90, 10)) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_y_continuous(breaks = seq(from = 1952, to = 2007, by = 5), limits = c(1947, 2012)) + 
  guides(fill = guide_colourbar(label.position = "left"))

heat

8 cowplot for compound figures!

Combine figures into a single compound figure

library(cowplot)
compound = plot_grid(lifeExp_hist, gdpLe_scatter, gap_line, heat, 
                     nrow = 2, ncol = 2,
                     scale = 0.85, 
                     labels = c("A)", "B)", "C)", "D)"))
## `geom_smooth()` using formula 'y ~ x'
compound

9 Exporting

Exporting graphs in R is straightforward. Start by clicking the “Export” button:
1. Click Copy to clipboard… if you want to quickly copy/paste a figure into a slideshow presentation or text document

  1. Click Save as image… (Raster/Bitmap formats such as .png, .jpeg, .tiff) if you want to explort to this format

NOTE: Not recommended because every pixel of a plot contains its own separate coding; not so great if you want to resize the image

  1. Click Save as PDF… (Vector format such as .pdf, .ps) to export to .pdf.

NOTE: Recommended! Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing

  • Export to .pdf
  • Open this .pdf file and click File –> Export As
  • Define resolution
  • Save as .tiff format (with .jpeg compression is applicable)
  • This allows you to maintain resolution while shrinking the file size!
  1. Or, export with ggsave
# Assume we saved our plot is an object called example.plot
ggsave(filename = "compound/compound.pdf", plot = compound, 
       width = 12, height = 8, units = "in", dpi = 600)

10 Challenge 5

Open “ggplot2_challenges.Rmd”. Use cowplot to create a compound figure named “compound_figure” that contains subplots A, B, C, and D that you created above.

11 Custom colors

Are you interested in custom colors for your plots? Check out the “Color and fill scales” section for discrete and continuous variables in the RStudio Visualization cheatsheet!